Introduction to Open Data Science - Course Project

Chapter 1: About the project

PHD-302 Introduction to Open Data Science 2023!

Check out my GitHub repository here: https://github.com/annatolo/IODS-project

date()
## [1] "Mon Nov 20 17:51:19 2023"

My Reflections on the Course

  • How am I feeling right now? I’m feeling excited and a bit nervous about starting this course because absolutely everything about this is completely new to me.
  • What do I expect to learn I’m hoping to learn the basics of R and gain some general understanding of data science.
  • Where did I hear about the course? I stumbled upon the course bu chance when scrolling through the curriculum and thought it sounded very interesting and useful.

My Learning Experiences So Far

So far, I have found the R for Health and Data Science book and the Exercise Set 1 to be an excellent resource for familiarising myself with R. Since I am a total beginner, Chapter 2 - R basics was my favourite, as comprehending Chapters 3-5 still require a bit more studying on my part… Everything to do with plots seems to be complicated! However, I think I like working with R Markdown in general.


Chapter 2: Regression and model validation

This week, after completing the data-wrangling exercise, I embarked on a statistical exploration of the students2014 dataset, which involved importing, examining the structure, and graphically visualizing student ages and exam scores.

I used histogram to analyze the age distribution and score variability, noting the skewness and outliers that provide insights into the student demographic and academic achievement. Boxplots offered a gender-based comparison of exam points, revealing median performances and exceptional cases.

Then, as per the instructions, I constructed a linear regression model to investigate the influence of students’ learning approaches on exam scores. Despite the model’s modest explanatory power, as indicated by an R-squared value of 4.07%, it did show some interesting points about the significance of the learning methods—deep, strategic, and surface—on academic outcomes.

Finally, I utilized diagnostic plots to validate the regression model’s assumptions, assessing linearity, normality, and the impact of influential data points. These visual tools illustrated the robustness of the model and any potential weaknesses in its fit to the data.

Throughout this process, I enhanced my understanding of data visualization techniques, the interpretation of statistical models, and the critical evaluation of model assumptions. This endeavor sharpened my analytical skills, particularly in applying statistical concepts to real-world educational data using R.

date()
## [1] "Mon Nov 20 17:51:20 2023"

Reading and exploring the data

I’m starting out by reading the students2014 data from my local folder.

students2014 <- read.csv("/Users/annatolonen/Desktop/IODS-project/data/learning2014.csv", sep = ",")

Next, I’m moving onto displaying the structure and dimensions of the dataset.

str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: num  3.9 3.3 3.3 3.7 3.5 3.8 3.7 3.7 3.4 2.9 ...
##  $ Deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ Stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ Surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166   7

Creating graphs

I’m starting to show the graphical overview of the data by creating histograms for numerical variables. First, the ages of the students…

hist(students2014$Age)

Description: The histogram illustrates the distribution of ages among the students who participated in the survey. Each bar corresponds to an age interval, showing the number of students that fall into that age bracket.

Interpretation:

  • Central Tendency and Spread: The histogram shows a concentration of students in the younger age groups, particularly in the range of early to mid-20s, which is common for a university setting.
  • Skewness: The distribution appears to be right-skewed, meaning there are fewer older students in the course. The tail extends to the right, indicating that while the majority of students are younger, there are a few students who are significantly older than the average.
  • Outliers: Any bars far to the right, might represent outliers or non-traditional students who are older than typical university age.

Next, the points obtained…

hist(students2014$Points)

Description: This histogram shows the distribution of exam points scored by students. Each bar represents the number of students that achieved a score within a specific range.

Interpretation:

  • Central Tendency: The most common score range is centered around 20 to 25 points, indicating that this is where the majority of students’ scores lie.

  • Spread: The distribution of scores spans from approximately 5 points to over 30, showing a wide range in performance among students.

  • Skewness : The distribution appears to be left-skewed, with a tail extending towards the lower end of points. This suggests that while most students scored around the middle range, a smaller number of students scored significantly lower.

  • Outliers: Any bars isolated from the others towards the higher end of the scale, could be considered outliers, representing students who scored much higher than their peers. These could be considered positive outliers. In the educational context, such outliers might indicate students who have a particularly strong grasp of the material, possibly due to prior knowledge, natural aptitude, or more effective study strategies. Conversely, isolated bars or data points at the lower end of the scale represent students who scored much lower than the majority. These would be negative outliers. Such outliers could suggest students who may have struggled with the course content or had external factors that negatively impacted their performance.

    • Implications of Outliers: The presence of outliers, especially if there are many or they are extreme, can have implications for teaching and learning. For example, it might prompt an instructor to consider whether the course materials are accessible to all students or whether additional support could be offered. It might also reflect the need for course content adjustments or highlight the presence of particularly challenging topics that could be addressed differently in the future.

    • Statistical Considerations: From a statistical perspective, outliers can affect the mean of the data, potentially skewing it away from the median. They can also impact the assumptions of certain statistical tests and models. For example, in regression analysis, outliers can disproportionately influence the slope of the regression line and the overall fit of the model, leading to misleading interpretations.

For comparing the distribution of exam point between female (F) and male (M) students, I’m creating a boxplot graph.

boxplot(Points ~ gender, data = students2014)

Description: The boxplot displays the distribution of exam points for students, segregated by gender. The central box of each plot represents the interquartile range (IQR) where the middle 50% of scores lie. The line within each box indicates the median score. The “whiskers” extend to the furthest points that are not considered outliers, and any points beyond the whiskers are plotted individually as potential outliers.

Interpretation:

  • Central Tendency: The median scores, marked by the lines in the boxes, indicate the central tendency of exam points for each gender group. They appear to be similar for both groups, suggesting that median performance on the exam is not substantially different between genders.
  • Variability: The IQRs, represented by the height of the boxes, show the spread of the middle 50% of the scores. It seems that there is a similar range of scores for both genders, indicating comparable variability in exam performance.
  • Outliers: Any individual points that appear as dots outside of the whiskers are potential outliers. In the case of the female group, there’s at least one score that is notably lower than the rest, signifying an outlier who scored significantly lower than other students.
  • Overall Distribution: The absence of outliers in the male group and the presence of outliers in the female group could be worth investigating further. It might suggest individual cases where additional support could be beneficial.

Generally, if the goal is to ensure equitable outcomes across genders, the similarity in median and IQR might be encouraging, but the presence of outliers in the female group might warrant a closer look to understand their causes. It’s also important to note that boxplots do not show the distribution’s modality or skewness; hence, the presence of outliers does not necessarily imply a skewed distribution.

Now, to show the relationship between students’ ages and their exam points, I’m creating a scatter plot.

plot(students2014$Age, students2014$Points)

Description: The scatter plot visualizes each student’s exam points against their age. Each dot represents a student, with the horizontal position indicating their age and the vertical position indicating the number of points they scored on the exam.

Interpretation:

  • Trend: The plot does not appear to show a clear linear relationship between age and exam points. The points are dispersed throughout, suggesting no strong correlation between a student’s age and their performance on the exam.
  • Clustering: Most of the data points are clustered in the lower age range, reflecting the typical age demographic of university students. There’s a high density of points where the age is between 20 and 30, which corresponds to the traditional age range for undergraduate students.
  • Outliers: There are some data points spread out across higher ages, representing older students. There do not seem to be any obvious patterns or anomalies with respect to their exam points when compared to the younger students.
  • Variability: There is variability in the exam points across all age groups, which seems consistent. This indicates that factors other than age are likely to be more predictive of exam performance.

Interestingly, when considering the potential impact of maturity and life experience on academic performance, as suggested by the age variable in the scatter plot, the lack of a clear trend may indicate that these factors do not have a straightforward or linear relationship with exam scores in the context of this course.

Summarizing each variable

summary(students2014)
##     gender               Age           Attitude          Deep      
##  Length:166         Min.   :17.00   Min.   :1.300   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:3.025   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.400   Median :3.667  
##                     Mean   :25.51   Mean   :3.381   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.775   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       Stra            Surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

The students2014 dataset reflects a student group with an age range from 17 to 55, indicating diversity in student demographics, though the average age is about 25, suggesting a predominantly young adult cohort. Attitudes towards statistics vary but generally lean positive, with a median score of 3.4 out of 5. Learning approaches show a tendency towards deeper, more strategic engagement, with less emphasis on surface-level learning, as indicated by the higher median scores for Deep and Stra and a lower median for Surf. Exam points are distributed across a wide range, from 7 to 33, with a median of 23, suggesting varied academic performance among the students.

Regression model using three variables of my choosing

Now, I’m choosing three variables as explanatory variables and fitting a regression model where exam points is the target (dependent, outcome) variable. Given that ‘Deep’, ‘Stra’, and ‘Surf’ represent different learning approaches and could potentially influence exam performance, they seem like reasonable choices for explanatory variables.

Fitting the linear regression model and showing its summary

model <- lm(Points ~ Deep + Stra + Surf, data = students2014)

summary(model)
## 
## Call:
## lm(formula = Points ~ Deep + Stra + Surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1235  -3.0737   0.5226   4.2799  10.3229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.9143     5.1169   5.260  4.5e-07 ***
## Deep         -0.7443     0.8662  -0.859   0.3915    
## Stra          0.9878     0.5962   1.657   0.0994 .  
## Surf         -1.6296     0.9153  -1.780   0.0769 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared:  0.04071,    Adjusted R-squared:  0.02295 
## F-statistic: 2.292 on 3 and 162 DF,  p-value: 0.08016

Interpreting the summary of my fitted model

To interpret the results, I’m applying the following principles:

  • Coefficients: Looking at the estimate for each variable to understand its impact on the exam points. A positive coefficient suggests that higher values of the variable are associated with higher exam points, while a negative coefficient suggests the opposite.
    • The coefficient for ‘Deep’ is -0.7443, indicating that, holding all other variables constant, a one-unit increase in the Deep score is associated with a 0.7443 point decrease in the exam score. However, this relationship is not statistically significant (p = 0.3915), suggesting that the Deep learning approach does not have a significant impact on exam points.
    • The coefficient for ‘Stra’ is 0.9878, implying that a more strategic approach to learning is associated with an increase in exam points. This coefficient approaches statistical significance (p = 0.0994), hinting at a potential positive relationship.
    • The coefficient for ‘Surf’ is -1.6296, suggesting that a higher surface approach score is associated with lower exam points. This coefficient is marginally significant (p = 0.0769), suggesting there may be some relationship between a surface learning approach and lower exam performance.
  • Statistical Significance: The p-values associated with each coefficient will indicate whether the variables have a statistically significant relationship with the exam points. Typically, a p-value less than 0.05 is considered statistically significant.
    • None of the p-values for the learning approaches are below the conventional significance threshold of 0.05, although Stra and Surf have p-values close to this cutoff, indicating that there might be a relationship worth further investigation.
  • Model Fit: The R-squared value indicates how well the model explains the variability in the exam points. An R-squared value closer to 1 means the model explains a large portion of the variability.
    • Residuals: The residuals, which are the differences between the observed values and the values predicted by the model, range from -15.12 to 10.32. The median closer to 0 suggests the model is somewhat centered on the data, but the wide range indicates there is considerable variability that the model is not capturing.
    • R-squared: The R-squared value of 0.04071 means that approximately 4.07% of the variance in exam points is explained by the model. This is a relatively low value, suggesting that the model does not fit the data very well. It’s indicating that the combination of ‘Deep’, ‘Stra’, and ‘Surf’ learning approaches explains only a small portion of the variation in students’ exam points. In other words, most of the variability in exam points is due to other factors not included in the model.
    • Adjusted R-squared: The adjusted R-squared is 0.02295, which adjusts the R-squared value based on the number of predictors in the model relative to the number of observations. It is also quite low, reinforcing the point that the model’s explanatory power is limited.

Producing diagnostic plots

Generating diagnostic plots for the linear regression model and setting up the plotting area to display 4 plots (2x2 layout)

par(mfrow = c(2, 2))
plot(model)

Explanations for the plots above:

  1. Residuals vs Fitted Values: Helps check the linearity and homoscedasticity assumptions.
  2. Normal Q-Q Plot: Helps check the normality assumption of residuals.
  3. Scale-Location (or Spread-Location) Plot: Another check for homoscedasticity.
  4. Residuals vs Leverage Plot: Helps identify influential observations.

Interpreting the plots:

  1. Residuals vs Fitted Values:
  • What We See: The residuals do not show a clear pattern around the horizontal line, but there seems to be a slight funnel shape, with a spread that increases for higher fitted values.
  • Interpretation: The lack of a clear pattern suggests that the relationship between the learning approaches and exam points is linear. However, the presence of a slight funnel shape might indicate that the variance of exam scores increases as the average score increases, which could be indicative of a more complex relationship at higher scores that isn’t captured by a simple linear model.
  1. Normal Q-Q Plot:
  • What We See: Most points lie on or very close to the line, but there are some deviations at the tails.
  • Interpretation: The residuals mostly follow the expected line of a normal distribution, indicating that the assumption of normality is reasonable for this data. The deviations at the tails might represent a small number of students with unusual score patterns, which could be further explored to understand their impact on the overall model.
  1. Scale-Location Plot:
  • What We See: The points are spread somewhat uniformly along the range of fitted values, although there might be a slight increase in spread on the right side.
  • Interpretation: The relatively uniform spread of residuals across the range of fitted values suggests that the variance of exam scores is consistent across different average score levels, although the slight increase on the right may require further investigation, perhaps looking into whether students with higher average scores show more variability in their performance.
  1. Residuals vs Leverage Plot:
  • What We See: There are no points with high leverage and large residuals. The Cook’s distance lines do not indicate any points with a particularly high influence on the model.
  • Interpretation: The plot shows that no students have an undue influence on the model, indicating that the model’s findings are not being driven by a few atypical students. This suggests that the conclusions drawn from the model about the relationship between learning approaches and exam scores are likely to be robust to the influence of individual students.

Chapter 3: Logistic regression

date()
## [1] "Mon Nov 20 17:51:20 2023"

Reading and exploring the data (Analysis step 2)

I’m starting out by reading the students2014 data from my local folder (and loading the necessary libraries).

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
joined_data <- read.csv("/Users/annatolonen/Desktop/IODS-project/data/joined_and_modified_data.csv")

Next, I’m printing the names of the variables.

print(names(joined_data))
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Description of the dataset: The dataset used in this chapter is a compilation of information from two separate studies on student performance from secondary schools in Portugal, which has been merged to provide a comprehensive look at various factors that might impact student success. It includes demographic details, academic achievements, and specific lifestyle choices, with a particular focus on alcohol consumption patterns. The inclusion of variables such as average alcohol use (alc_use) and a binary indicator of high alcohol consumption (high_use) allows for an analysis of the potential influence of alcohol on academic performance. This dataset can be analysed to understand and possibly predict student performance in relation to their personal and social habits.

Choosing 4 variables and hypothesising (Analysis step 3)

  1. Academic performance (‘G3’): Here I am assuming taht the final grade can be considered a strong indication of overall academic achievement.
  • My hypothesis: Higher alcohol consumption (as indicated by alc_use and high_use) is negatively associated with academic performance. Students with high alcohol consumption will have lower final grades (G3) compared to their counterparts who consume less alcohol.
  1. Free Time After School (‘freetime’): This variable indicates the amount of free time students have after school.
  • My hypothesis: Students with more free time may have higher alcohol consumption. An abundance of unstructured time could lead to increased social activities where alcohol is present, thereby increasing alc_use.
  1. Going Out With Friends (‘goout’): This indicates how often students go out with their friends.
  • My hypothesis: There is a positive correlation between the frequency of going out with friends and alcohol consumption. Students who go out more often are more likely to engage in social drinking, thus having a higher alc_use score.
  1. Study Time (‘studytime’): This variable reflects the amount of time spent studying every week.
  • My hypothesis: Students who dedicate more time to studying will have lower alcohol consumption. A greater commitment to academics could correlate with more responsible drinking habits and a lower high_use designation.

Numerically and graphically exploring the chosen varaiables (Analysis step 4)

# Numerically exploring the variables using summary for each
summary(joined_data$G3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   12.00   11.52   14.00   18.00
summary(joined_data$freetime) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.000   3.224   4.000   5.000
summary(joined_data$goout)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.116   4.000   5.000
summary(joined_data$studytime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.043   2.000   4.000
  • The average final grade (G3) is approximately 11.52 out of 20.
  • On average, students have a moderate amount of free time after school (freetime), with a mean of 3.22 on a scale of 1 to 5.
  • Students tend to go out with friends (goout) relatively frequently, with an average score of 3.12.
  • Average study time (studytime) is around 2.04, which suggests a few hours of study per week.
  • The mean alcohol consumption (alc_use) score is 1.89, which indicates low to moderate drinking on average.
# Cross-tabulations of 'high_use' with 'freetime', 'goout', and 'studytime'
table(joined_data$high_use, joined_data$freetime)
##        
##           1   2   3   4   5
##   FALSE  15  46 112  65  21
##   TRUE    2  14  40  40  15
table(joined_data$high_use, joined_data$goout)
##        
##          1  2  3  4  5
##   FALSE 19 82 97 40 21
##   TRUE   3 15 23 38 32
table(joined_data$high_use, joined_data$studytime)
##        
##           1   2   3   4
##   FALSE  56 128  52  23
##   TRUE   42  57   8   4
  • Students with less free time tend to have lower alcohol consumption (high_use is less frequent with a freetime score of 1).
  • Higher levels of going out (goout) are associated with higher alcohol consumption.
  • More study time (studytime) is associated with lower instances of high alcohol consumption (high_use).

Next, for the graphical exploration, I’m going to create bar plots for ‘the variables ’freetime’, ‘goout’ and ‘studytime’ agains ‘alc_use’.

ggplot(joined_data, aes(x = freetime, y = alc_use)) +
  geom_bar(stat = "summary", fun = "mean") +
  labs(title = "Average Alcohol Consumption by Free Time", x = "Free Time", y = "Average Alcohol Use")

ggplot(joined_data, aes(x = goout, y = alc_use)) +
  geom_bar(stat = "summary", fun = "mean") +
  labs(title = "Average Alcohol Consumption by Going Out", x = "Going Out", y = "Average Alcohol Use")

ggplot(joined_data, aes(x = studytime, y = alc_use)) +
  geom_bar(stat = "summary", fun = "mean") +
  labs(title = "Average Alcohol Consumption by Study Time", x = "Study Time", y = "Average Alcohol Use")

  • These bar plots indicate that average alcohol consumption increases with more free time and higher frequency of going out. This supports my hypothesis that more unstructured time and social activities could lead to higher alcohol consumption.
  • Conversely, there is an apparent decrease in average alcohol consumption with increased study time, which aligns with the hypothesis that students who dedicate more time to studying may drink less.

To explore academic performance (‘G3’) agins ‘high = use’, I’m going to create a box plot.

ggplot(joined_data, aes(x = as.factor(high_use), y = G3)) +
  geom_boxplot() +
  labs(title = "Final Grade (G3) by High Alcohol Consumption", x = "High Alcohol Use", y = "Final Grade (G3)")

  • This box plot shows that students with high alcohol consumption tend to have a wider range and generally lower median grades compared to those with low alcohol consumption. This observation supports the hypothesis of a negative association between alcohol consumption and academic performance.

Overall, the results of the exploration align with my previously stated hypotheses. There is an indication that higher alcohol consumption is related to lower academic performance, more free time, increased socialization with friends, and less study time.

Logistic regression analysis (Analysis step 5)

# Fitting the logistic regression model using 'high_use' as the response variable
# and G3, freetime, goout, and studytime as predictors
model <- glm(high_use ~ G3 + freetime + goout + studytime, 
             data = joined_data, family = "binomial")

# Getting a summary of the fitted model
summary(model)
## 
## Call:
## glm(formula = high_use ~ G3 + freetime + goout + studytime, family = "binomial", 
##     data = joined_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.95210    0.76876  -2.539 0.011108 *  
## G3          -0.03499    0.03931  -0.890 0.373397    
## freetime     0.10832    0.13728   0.789 0.430067    
## goout        0.70101    0.12248   5.723 1.04e-08 ***
## studytime   -0.58983    0.16798  -3.511 0.000446 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 386.02  on 365  degrees of freedom
## AIC: 396.02
## 
## Number of Fisher Scoring iterations: 4
# Calculating and printing the odds ratios and confidence intervals for the coefficients
exp_coef <- exp(coef(model))
exp_confint <- exp(confint(model))
## Waiting for profiling to be done...
# Creating a data frame to nicely format the odds ratios and their confidence intervals
odds_ratio_df <- data.frame(
  Variable = names(exp_coef),
  OddsRatio = exp_coef,
  Lower95CI = exp_confint[,1],
  Upper95CI = exp_confint[,2]
)

# Printing the odds ratios and confidence intervals
print(odds_ratio_df)
##                Variable OddsRatio  Lower95CI Upper95CI
## (Intercept) (Intercept) 0.1419750 0.03046126 0.6255685
## G3                   G3 0.9656148 0.89381353 1.0432400
## freetime       freetime 1.1144073 0.85141137 1.4606593
## goout             goout 2.0157819 1.59543764 2.5817002
## studytime     studytime 0.5544194 0.39446586 0.7636077
# install.packages("broom")
library(broom)

# Tidying the model to include exponentiated coefficients (odds ratios) and confidence intervals
tidy_model <- tidy(model, exponentiate = TRUE, conf.int = TRUE)

# Printing the tidy model with odds ratios
print(tidy_model)
## # A tibble: 5 × 7
##   term        estimate std.error statistic      p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>    <dbl>     <dbl>
## 1 (Intercept)    0.142    0.769     -2.54  0.0111         0.0305     0.626
## 2 G3             0.966    0.0393    -0.890 0.373          0.894      1.04 
## 3 freetime       1.11     0.137      0.789 0.430          0.851      1.46 
## 4 goout          2.02     0.122      5.72  0.0000000104   1.60       2.58 
## 5 studytime      0.554    0.168     -3.51  0.000446       0.394      0.764

Now, to interpret

  • **Intercept: The odds ratio is 0.142, which represents the odds of high alcohol consumption when all other variables are held at zero. The p-value is significant (p = 0.011), suggesting that the model intercept is significantly different from zero.
  • G3 (Final Grade):
    • Odds Ratio (OR): 0.966, suggesting a slight decrease in the odds of high alcohol consumption with better academic performance, although this is a very small effect.
    • 95% Confidence Interval (CI): Ranges from 0.894 to 1.043, which includes 1, indicating the effect might not be statistically significant.
    • P-value: 0.373, which is greater than the conventional threshold of 0.05 for statistical significance, confirming that the effect of final grades on high alcohol use is not statistically significant.
  • Freetime:
    • OR: 1.114, suggesting a slight increase in the odds of high alcohol consumption with more free time.
    • CI: Ranges from 0.852 to 1.458, which includes 1, indicating the effect might not be statistically significant.
    • P-value: 0.430, which is not statistically significant.
  • Going Out (Goout):
    • OR: 2.016, indicating that the odds of high alcohol consumption are about twice as high for each additional unit increase in the goout score.
    • CI: Ranges from 1.595 to 2.581, which does not include 1, indicating that this effect is statistically significant.
    • P-value: 0.0001, which is highly statistically significant, providing strong evidence that going out with friends is positively associated with high alcohol consumption.
  • Study Time:
    • OR: 0.554, suggesting that more study time is associated with lower odds of high alcohol consumption.
    • CI: Ranges from 0.395 to 0.764, which does not include 1, indicating that this effect is statistically significant.
    • P-value: 0.0005, which is also highly statistically significant, supporting the hypothesis that increased study time is associated with lower alcohol consumption.

In summary, the logistic regression analysis indicates that two of the variables, goout and studytime, have statistically significant associations with high alcohol consumption. Students who go out more often are more likely to have high alcohol consumption, while students who spend more time studying are less likely to. These results are in line with my hypotheses for these variables. However, academic performance (G3) and the amount of free time students have (freetime) do not show statistically significant effects on high alcohol consumption in this analysis.

Exploring the predictive power of my model (Analysis step 6)

Now, I’m fitting my logistic regression model using only the predictors that were determined to be significant above.

# Fitting the logistic regression model with significant predictors
model <- glm(high_use ~ goout + studytime, data = joined_data, family = "binomial")

# Making predictions on the training data
joined_data$predicted_high_use <- if_else(predict(model, type = "response") > 0.5, TRUE, FALSE)

# Creating a 2x2 cross-tabulation of predictions vs actual values
confusion_matrix <- table(Actual = joined_data$high_use, Predicted = joined_data$predicted_high_use)

# Calculating the total proportion of inaccurately classified individuals (the training error)
training_error <- mean(joined_data$high_use != joined_data$predicted_high_use)

# Determining the most frequent class in the actual data
most_frequent_class <- names(which.max(table(joined_data$high_use)))

# Calculating the error rate for the guessing strategy
guessing_strategy_error <- mean(joined_data$high_use != most_frequent_class)

# Printing the confusion matrix, training error, and guessing strategy error
print(confusion_matrix)
##        Predicted
## Actual  FALSE TRUE
##   FALSE   238   21
##   TRUE     70   41
print(paste("Training error: ", training_error))
## [1] "Training error:  0.245945945945946"
print(paste("Guessing strategy error: ", guessing_strategy_error))
## [1] "Guessing strategy error:  0.3"
# Creating a graphic visualization of actual values vs predictions
ggplot(joined_data, aes(x = as.factor(high_use), fill = as.factor(predicted_high_use))) +
  geom_bar(position = "fill") +
  labs(title = "Actual vs Predicted High Alcohol Use", x = "Actual High Use", y = "Proportion", fill = "Predicted") +
  scale_y_continuous(labels = scales::percent_format())

Interpretations

  • Confusion Matrix: The confusion matrix shows that the model has a higher number of true negatives than true positives, indicating that it is better at predicting individuals with low alcohol use than high alcohol use. There are also a considerable number of false positives, where the model predicted high alcohol use but was incorrect. This could be an area to focus on for improving the model.
  • Training Error: The training error of approximately 0.245 indicates that about 24.5% of the predictions made by the model on the training data were incorrect. While this error rate is not trivial, it suggests that the model has learned some patterns from the data. It’s important to consider this error rate in the context of the complexity of human behavior and the difficulty of predicting alcohol use.
  • Guessing Strategy Error: The guessing strategy error rate is 0.3, which means that simply guessing the most frequent class (in this case, low alcohol use) would result in an incorrect prediction 30% of the time. My model’s training error is lower than the guessing strategy error, indicating that the logistic regression model is providing more accurate predictions than a naive guess based on the most common outcome.
  • Bar Plot: The bar plot visualizes the proportions of actual versus predicted high alcohol use. The larger proportion of red in the ‘FALSE’ actual high use category indicates that my model is more conservative, predicting ‘False’ more often than ‘True’.
  • Model vs. Guessing Strategy: When comparing my model to the guessing strategy, my model outperforms the naive approach of always guessing the most frequent class. This demonstrates that my model, while not perfect, does provide valuable insights and can be a useful tool for understanding factors that contribute to high alcohol use among students.

10-fold cross-validation on my model (Bonus)

# Loading the necessary library for cross-validation
library(boot)

# Defining the logistic regression model formula
model_formula <- high_use ~ goout + studytime

# Creating the glm model for cross-validation
glm_model <- glm(model_formula, data = joined_data, family = "binomial")

# Performing 10-fold cross-validation using the cv.glm function
set.seed(123)  # for reproducibility
cv_results <- cv.glm(joined_data, glm_model, K = 10)

# Printing the cross-validation results
print(cv_results)
## $call
## cv.glm(data = joined_data, glmfit = glm_model, K = 10)
## 
## $K
## [1] 10
## 
## $delta
## [1] 0.1750966 0.1749537
## 
## $seed
##   [1]       10403         624  -983674937   643431772  1162448557  -959247990
##   [7]  -133913213  2107846888   370274761 -2022780170  -412390145   848182068
##  [13]  -266662747 -1309507294  1356997179  1661823040  1749531457  -516669426
##  [19]  1042678071 -1279933428  -410084963  1151007674  -895613453  1288379032
##  [25]  -376044615 -1358274522   307686511   101447652  1796216213 -1567696558
##  [31]  1186934955 -1925339152  -472470735    80319294 -1524429145   326645436
##  [37]  -389586803  -400786966  -890731933  -852332472  1365217705 -1785317034
##  [43] -1551153185  1359863956  2098748037 -1013039742  -329721061 -1587358816
##  [49]   344102689 -1520389522   166492183  1821136236  1646453629  1056605210
##  [55] -1419044141  -806080008   520985497   711286406  2004844367 -1445006012
##  [61]  1329781621 -1188844110 -1089068661  1173875536 -1983217903   514629022
##  [67]  -237421177  -258138084  -930078099   261626442  1349308227 -1125425240
##  [73] -1677778551    25874358   409637567 -1987430924  1583257701  -136173086
##  [79]   639501307   272101120 -1024630015 -1994369842  -939499785 -1944742196
##  [85]  -591520419 -1994900358  1072996275  1119025496  2035491705 -2082894618
##  [91]   776176175   -69557596  1794806101  -178474478  -497581461   874372784
##  [97]   518669041  -370223106  1295572071 -1776240260 -1692674995  1935534762
## [103]   298421283   111542024 -1075273367   518297110  -289321569  1331379028
## [109]  1768277573  1473660482  2120850651   879016544  -864018719  1661675310
## [115]   135902679 -2136373204   735594301  1594631386  -546138989  1423929528
## [121] -1067541671  1962863430 -1923418865  -984154108  1907308341   642901618
## [127] -1095019701 -1836613104 -1171392815  1663582814 -1258689721 -2007301412
## [133]  -756910547  -712643830 -1271482109  -801485208    51646793 -1925477258
## [139] -1421379457  1104736436 -1348082651  -124611934   292791739  2126591424
## [145] -2043491647  -709285490 -1242530633  1688217996  -538353379 -1997652678
## [151]   -48432781   575772696   942146361    57506214  -948054033   -72610460
## [157]  1389939989   656100050   -25586645 -2012424848  1854773937  1391516862
## [163] -2100008409  -140248004 -1638135795 -2077746326  -118729245 -1417654840
## [169]   662270249   942125782 -1363864737   744183316  2123821573   -80802046
## [175] -1753997669  1277518112  1090348705  1338137582   423408535   -28214548
## [181]  1164536573  1524008346   673959507   853634936 -1599644903 -2135083002
## [187]  -345756977 -1070478652   971985653  -556736718  -406174453   663083216
## [193]  1258368657  1306568478  1820350727 -1068259940  -402617875  1499233226
## [199] -1121819965 -1773142744  1796260105  1463879990   901914175   104491892
## [205]  1605431269 -1933329566  1688405883  -446088064  1238889089   197049934
## [211]  -709469577 -1072333748  1691378909 -1260585478   198644531  2053568216
## [217]   903127801 -1970919834  -473567825  1614821412 -1905604395  1082827666
## [223]  1558537707  1875545136  1518383729 -1265655426 -2085242905  1791098620
## [229]  1447558093 -1153758166   -99557469   -92185464 -2016280343  1847562134
## [235]  1495701791  -221368108   409722309  -429353022  1765302363  2137311200
## [241]  -373658015   273043630  -350916265  -935055956    43404989    52012634
## [247]  1867958291  1488596536 -1347953959   174081222  2002460815  1429165444
## [253]  -205312331  1264621554  -603785525  1270017936 -1543231919 -1282028578
## [259]   908887751   726075484  1269456301 -1680094070  -990917501 -1377014808
## [265] -1279971127  1281050102   228230143  1097770548 -1438663771  1295361058
## [271]   829172027   988808000  1704778305   804328206 -1257113545  -516583668
## [277] -1624037219  1034190522   904064243 -1716316776  1108935353   904106790
## [283]  1222361967  1146561252  1232110741   174767186  2136668075 -1843985680
## [289]   713263665  1133192766  1302119847  -499465796  -425742451  2035727594
## [295]  1324820835  -227988664 -1598926679   227290198   601218783  1836305300
## [301]  1386514821   306372738  -445226469   618852000   -25741791   156697966
## [307]  -345772265 -2126405524  1998516861  -392853734  1588822483  1965665528
## [313] -1658840423 -1901588090  -687876529   -15753148 -1427453323 -1799286606
## [319]   -47880053    97437264  -319365615   688369822  -272731001   469052188
## [325]    27259245  1573117258  -446761405  1976539816  2093047945   424297142
## [331]  1217440191   506831092 -1961736347 -1834464030  1234111227   907381248
## [337]  -247365119   118499278 -1581033993  -893361716 -2100188067   335855482
## [343]    83920563 -1896483752  -323673479  -498745370  2088720687 -2102342236
## [349]  1873412181   226202898 -1483060885  1437743536  -430562831  -190616834
## [355] -1639345305   281953404   857940813  -549769814  -245419229  1375189512
## [361]  -237346711   590186774    75687071   655107668   151057733   930998594
## [367] -1108466725  1398789472  1995685345  1605663278  1206398167 -1945513172
## [373]  1992513085  1544169434  1610742675  -152048712  -657450407  1247059526
## [379]  1880247311  -124605692   723920437 -1548596878  1827773003   479812880
## [385]   228152785    49698142   922100295 -1524757028  -845069011   534031882
## [391]  -131080189   213485928   636833865   718143350 -1134260353 -2024842316
## [397] -1108831451  1977333154  1053535419  1301926080  -997856831   366738574
## [403] -1450544201  1064694924 -1016336355  -390217670 -1024466829   686789400
## [409] -2056715719   745319590  -999248145 -1240647580 -1395180523 -1837290030
## [415]  -681354453  -514051984  1438153137  2090364862  -209968857  1765574460
## [421]  -544057587  -844603798 -1693909789 -1746073400 -1156960215  2076419542
## [427] -1326601633  1784103188  -683597563  -824593918  1683989915  -509903840
## [433]   183502241  -132206866  -295556457   190629356 -1790739971  1849133210
## [439] -1660799661   214755960 -1837639143   975563526  1750237647  1014527428
## [445]     3490293   552878642   220695563   382907344 -1381266031  1445050910
## [451]  1771278343 -1719553892   862869741   583941834 -1759344189  1365915688
## [457]  -820969463 -1381598154   -19516097   662427252 -1098735899  -812655006
## [463]  1658982011 -1203972224  1999245697 -1592487602 -1708699273 -1038727348
## [469]  -725486627   747602170  2037447219  -161484328   469017081  1897421158
## [475]   644859055   959210276  1824012245 -1573943662  -797561621   466937648
## [481]     6984049  1344943230 -1963692313   507873788  1336756941  -446804182
## [487]  -978024797    50927496   -66994199 -1542552938 -1630130145  1108679636
## [493]   421858501   286669506   176875355  1716904672   841747809  2002101166
## [499] -1936594857  -503678804   643784125  -270685862    -9162989 -1518294728
## [505] -1177069095   450623430 -1518307441 -2055143292  1977097653  1967586034
## [511]  2139569611   993708688   887981393  -146153762 -1521041977 -1948249252
## [517]  1992764589  1735430026   469169027  -492722456  1473540041 -1902921482
## [523]  1705351935  1769673012  -929011035   948225826  -946720709  1824431680
## [529]  1626208577 -1384520178    22671159 -1788782068  -359417955   272236986
## [535]  -230435853  1174868120 -2145910343  -855063002  1748802159   651054564
## [541]  -619908203    89300818   345161387 -1411621392   774662449 -1541883586
## [547]  1651670183   581520572 -1489764723 -2028142614 -1423847325 -1844713912
## [553]  1954615209  -389144746    66876895  2030417556  -361973627  -151813246
## [559] -1573918437   944703904   610784545  1108957294 -1875417577 -1297945748
## [565]  1037500797  1908181530   823650515  1875585016   -22111847  1765196934
## [571]  -849597105  1315720004 -1748059787  -915770446   634433419 -1869504176
## [577]  -887145199  2066662302  -939545721  -822528484 -1687437203 -1367629750
## [583] -1603461821   522180008  1610588041  2052437430   110280895  2014120948
## [589]  -670960027   159018978  1050415611   568272128 -1718509311    -3409202
## [595]   753028343 -1139331892  -123651235 -2072165766 -1222087245   648343384
## [601]  1100161401   486404838   261566511  1504901284  -476745899  1151760402
## [607]  -445050773  -130902864  -423755535  1831075326   934693479   690474876
## [613]  -907644339  -744197974  1158732323    62223624 -1538777239  1455586326
## [619]  -702514273 -1712778924   651699269   959548482  -586241317  1850142816
## [625]  -647799583  2099891502
# Calculating the average prediction error
cv_error <- cv_results$delta[1]

# Printing the average prediction error
print(paste("10-fold CV average prediction error: ", cv_error))
## [1] "10-fold CV average prediction error:  0.175096615513868"
# Comparing to the model error from the Exercise Set
exercise_set_error <- 0.26
print(paste("Is the cross-validated model better than the Exercise Set model? ", cv_error < exercise_set_error))
## [1] "Is the cross-validated model better than the Exercise Set model?  TRUE"

It seems that my model does have a smaller prediction error using 10-fold cross-validation compared to the model from the Exercise Set. Yay!


Chapter 4: Boston Housing Data Analysis

In this chapter, we explore the Boston Housing dataset, which comprises various attributes of housing areas around the Boston, Massachusetts area, as recorded in the 1970s. It’s a rich dataset often used for understanding the housing market through statistical learning techniques. As per the assignment, I begin by loading the data and examining its structure—highlighting key variables like crime rates, property tax rates, and median home values. Next, I try to provide a visual and statistical summary, discussing each variable’s distribution and interrelationships. Then, I standardized the data to prepare for more complex analyses, including clustering and linear discriminant analysis (LDA), which reveal insights into the socio-economic patterns affecting housing values.

Through this assignment, I’ve learned new statistical learning techniques. I gained insights into housing market patterns by performing exploratory data analysis, standardization, clustering, and discriminant analysis, and enhanced my data visualization skills further.

date()
## [1] "Mon Nov 20 17:51:21 2023"
# Loading and Exploring the Boston Dataset
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

##Graphical Overview of the Data

Next I’m going to use ‘pairs’ to create pair-wise scatter plots for an overview of relationships and ‘summary’ to give a statistical summary of each variable.

pairs(Boston)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

###Distributions of the Variables

Relationships Between Variables

  • rm and medv: A positive correlation suggests that suburbs with more rooms tend to have higher median home values.
  • lstat and medv: A visible negative correlation implies that suburbs with a higher percentage of lower status have lower home values.
  • nox and indus: A positive correlation indicates that more industrial areas have higher nitric oxide concentrations.
  • dis and nox: A negative correlation suggests that areas further from employment centers have lower concentrations of nitric oxides.
  • age and nox: There seems to be a trend where older houses are in areas with higher nitric oxide concentrations.
  • rad and tax: A high correlation indicates that suburbs with better highway access tend to have higher tax rates.

Standardization and Categorical Variable Creation

# Installing and loading the caret package
if (!require(caret)) {
  install.packages("caret")
  library(caret)
}
## Loading required package: caret
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
## 
##     melanoma
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
# Standardizing the dataset
scaled_Boston <- scale(Boston)

# Printing out summaries of the scaled data
summary(scaled_Boston)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# Creating a categorical variable of the crime rate using quantiles
Boston$crime_cat <- cut(Boston$crim, breaks=quantile(Boston$crim, probs=seq(0, 1, by=0.25)), include.lowest=TRUE)

# Dropping the old crime rate variable from the dataset
Boston <- Boston[,-which(names(Boston) == "crim")]

# Dividing the dataset into train and test sets (80% train, 20% test)
trainIndex <- createDataPartition(Boston$crime_cat, p = .8, list = FALSE, times = 1)
train_set <- Boston[trainIndex, ]
test_set <- Boston[-trainIndex, ]

Linear Discriminant Analysis (LDA)

# Loading the MASS package for LDA
library(MASS)

# Fitting the LDA model using the categorical crime rate as the target variable
lda_fit <- lda(crime_cat ~ ., data = train_set)

# Summarizing the LDA fit
lda_fit
## Call:
## lda(crime_cat ~ ., data = train_set)
## 
## Prior probabilities of groups:
## [0.00632,0.082]   (0.082,0.257]    (0.257,3.68]       (3.68,89] 
##       0.2512315       0.2487685       0.2487685       0.2512315 
## 
## Group means:
##                        zn     indus       chas       nox       rm      age
## [0.00632,0.082] 34.867647  4.389902 0.03921569 0.4482049 6.618569 42.48235
## (0.082,0.257]    9.386139  9.156832 0.05940594 0.4887515 6.224386 57.50495
## (0.257,3.68]     2.415842 12.474653 0.12871287 0.5974059 6.357050 79.36832
## (3.68,89]        0.000000 18.114510 0.06862745 0.6821078 6.013686 91.23627
##                      dis       rad      tax  ptratio    black     lstat
## [0.00632,0.082] 5.699057  3.490196 284.4510 17.52059 391.9486  6.928235
## (0.082,0.257]   4.621321  4.792079 322.8119 18.27822 385.2170 11.579208
## (0.257,3.68]    3.041023  6.009901 355.9208 17.78119 362.6541 12.779703
## (3.68,89]       2.011357 23.813725 663.4216 20.14608 294.5307 18.627157
##                     medv
## [0.00632,0.082] 27.76961
## (0.082,0.257]   22.87822
## (0.257,3.68]    24.26337
## (3.68,89]       16.03627
## 
## Coefficients of linear discriminants:
##                   LD1           LD2           LD3
## zn       0.0058638802  0.0336333261  -0.038838035
## indus    0.0097365728 -0.0770542250   0.071930089
## chas    -0.3058741802 -0.2575565730   0.151796895
## nox      3.2547006401 -4.9810251040 -11.481520459
## rm      -0.1253415623 -0.1113155702  -0.184244380
## age      0.0099843038 -0.0084262780  -0.007954078
## dis     -0.0470505232 -0.1668911643   0.156760508
## rad      0.3708158600  0.0936078668   0.028139720
## tax     -0.0003962842  0.0010585594   0.001539570
## ptratio  0.0693013006  0.0365044515  -0.129862507
## black   -0.0013254302  0.0004298002   0.001222392
## lstat    0.0345479590 -0.0358900241   0.046297776
## medv     0.0251083797 -0.0384011550  -0.024467292
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9483 0.0399 0.0119
# Plotting the LDA model
plot(lda_fit)

Predictions and Cross-Tabulation using LDA

# Saving the actual crime categories from the test set
actual_crime_categories <- test_set$crime_cat

# Removing the categorical crime variable from the test set

# Using the LDA model to predict crime categories on the test set
predicted_crime_categories <- predict(lda_fit, newdata=test_set)$class

test_set <- test_set[,-which(names(test_set) == "crime_cat")]

# Cross-tabulating the predicted vs actual crime categories
confusion_matrix <- table(Predicted = predicted_crime_categories, Actual = actual_crime_categories)

# Printing the confusion matrix
confusion_matrix
##                  Actual
## Predicted         [0.00632,0.082] (0.082,0.257] (0.257,3.68] (3.68,89]
##   [0.00632,0.082]              13             2            0         0
##   (0.082,0.257]                 8            18            8         0
##   (0.257,3.68]                  4             5           16         0
##   (3.68,89]                     0             0            1        25

Based on the confusion matrix above we can evaluate the performance of the classification models as follows:

  • Low Crime Rate ([0, 0.082]): The model predicted this category correctly 14 times, but incorrectly predicted it 2 times as medium-low crime ([0.082, 0.257]) and missed 9 instances which were actually medium-low crime.
  • Medium-Low Crime Rate ([0.082, 0.257]): 13 instances were correctly predicted, but 9 instances were predicted as low crime, and 2 instances were predicted as medium crime ([0.257, 3.68]).
  • Medium Crime Rate ([0.257, 3.68]): The model predicted this category correctly 10 times, but incorrectly predicted 14 instances as medium-low crime, and failed to predict 3 instances which were actually high crime ([3.68, 89]).
  • High Crime Rate ([3.68, 89]): All 25 high crime instances were correctly identified, with no misclassifications either from or to this category.

Key Observations from the results:

  • The model is particularly effective at correctly identifying the high crime rate category.
  • There’s some confusion between the low and medium-low crime rate categories, as well as between medium-low and medium crime rate categories.
  • The model does not misclassify any non-high crime rates as high crime, which might be particularly important if the goal is to accurately identify high-crime areas.

K-Means Clustering

Next, I’m going to perform k-means clustering on the standardized Boston dataset to identify clusters within the data.

# Reload the Boston dataset
data("Boston")

# Standardizing the dataset
scaled_Boston <- scale(Boston)

# Calculating the distances between observations
distances <- dist(scaled_Boston)

# Installing and loading the factoextra package
if (!require(factoextra)) {
  install.packages("factoextra")
  library(factoextra)
}
## Loading required package: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Determining the optimal number of clusters using the elbow method
set.seed(123)
fviz_nbclust(scaled_Boston, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2) +
  labs(subtitle = "Elbow method")

# Running k-means with the determined optimal number of clusters
set.seed(123)
kmeans_result <- kmeans(scaled_Boston, centers = 4)

# Creating a data frame for plotting
plot_data <- as.data.frame(scaled_Boston)
plot_data$cluster <- factor(kmeans_result$cluster)

# Visualizing the clusters using two variables from the dataset
ggplot(plot_data, aes(x = rm, y = lstat)) +
  geom_point(aes(color = cluster)) +
  labs(color = 'Cluster')

  1. Cluster Distribution: The plot shows how the observations are grouped into four different clusters. Each cluster is represented by a different color.
  2. Cluster Characteristics:
  • Cluster 1: Characterized by lower rm values and higher lstat values, indicating smaller houses in areas with a higher proportion of lower-status population.
  • Cluster 2: Moderate rm values and lstat values, suggesting average-sized rooms with a moderate lower-status population.
  • Cluster 3: Higher rm values and moderate to low lstat values, indicating larger houses with a lower proportion of lower-status population.
  • Cluster 4: Moderate to high rm values but very low lstat values, suggesting these areas have larger houses and very low lower-status population proportions.
  1. Correlation Inference: There appears to be a negative correlation between rm and lstat, as expected. Areas with more rooms tend to have a lower percentage of lower-status population.
  2. Cluster Separation: The separation between clusters indicates how distinct the groups are based on the two variables used. For example, Cluster 4 is well separated from the others, suggesting that areas with larger houses and very low lower-status proportions are quite distinct from other areas.
  3. Outliers: Any points that are far away from their cluster centers might be considered outliers. For instance, any points in Cluster 1 that are far into the region of Cluster 3 could be unusual observations worth further investigation.
  4. Potential for Further Analysis: The clustering suggests that there may be underlying patterns in the Boston housing data related to room size and socio-economic status. These patterns could be explored further with additional socio-economic variables, or by looking at how these clusters relate to other outcomes like median home values.

Bonus

library(MASS)
data("Boston")

# Standardizing the dataset
scaled_Boston <- scale(Boston)

Next, performing the K-means clustering

set.seed(123) # for reproducibility
# According to the assignment,  a reasonable number of clusters is >2, here I'm choosing 4
kmeans_result <- kmeans(scaled_Boston, centers = 4)

Now, performing LDA using clusters as target classes

# Adding the cluster assignments as a factor to the Boston data
Boston$cluster <- factor(kmeans_result$cluster)

# Fitting LDA model using the clusters as target classes
library(MASS) # for LDA
lda_fit <- lda(cluster ~ ., data = Boston)

Finally, visualizing the results with Biplot

# Biplot for LDA with arrows for original variables
plot(lda_fit)

# Examining the model's coefficients
lda_fit$scaling
##                  LD1          LD2          LD3
## crim    -0.033252529  0.093706888 -0.003875224
## zn      -0.003852921  0.005772233 -0.047264096
## indus   -0.107232590 -0.066745602 -0.041117990
## chas     0.068548580 -0.197356300  0.410601305
## nox     -7.766232173 -5.409634182 -7.673642525
## rm       0.139330592  0.341290080 -0.794324360
## age      0.003903578 -0.004331691  0.015529903
## dis     -0.001695810 -0.034881757 -0.176213368
## rad     -0.063928129 -0.015823030 -0.007533964
## tax     -0.004651926 -0.002905580 -0.003796952
## ptratio -0.159544007 -0.041917959 -0.038208490
## black    0.008358645 -0.020444688 -0.004007031
## lstat   -0.028069814  0.013570317 -0.056673688
## medv     0.005932882  0.013682649 -0.086102624

The LDA scaling coefficients reveal that nox (nitric oxides concentration) is the predominant variable influencing the separation of clusters on the first discriminant (LD1), indicating its strong role in differentiating the clusters. rm (average number of rooms) emerges as the most significant for the second discriminant (LD2), suggesting its importance in further distinguishing between clusters. On the third discriminant (LD3), chas (proximity to Charles River) has the highest coefficient, highlighting its influence in cluster separation at this level. These variables—nox, rm, and chas—are therefore the most critical linear separators for the clusters, with their varying scales and units contributing to their discriminant weights.

Super-Bonus

Now, the goal is to project the train data using the LDA model’s scaling matrix and then visualize the projections in 3D.

library(dplyr)

# Here 'train' is the training set and 'lda_fit' is my LDA model from before
model_predictors <- Boston[, -which(names(Boston) == "cluster")]

# Checking the dimensions
dim(model_predictors)
## [1] 506  14
dim(lda_fit$scaling)
## [1] 14  3
# Matrix multiplication to get the projections
matrix_product <- as.matrix(model_predictors) %*% lda_fit$scaling
matrix_product <- as.data.frame(matrix_product)

Now, onto the 3D visualization…

# Installing and loading the plotly package
if (!require(plotly)) {
  install.packages("plotly")
  library(plotly)
}
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Creating a 3D scatter plot using the LDA projections
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type = 'scatter3d', mode = 'markers', color = Boston$cluster) %>%
  layout(scene = list(xaxis = list(title = 'LD1'),
                      yaxis = list(title = 'LD2'),
                      zaxis = list(title = 'LD3')))

Interpretation:

  • Cluster Delineation: The plot suggests that the clusters have distinct regions in the multidimensional space defined by the LDA, although there is some overlap, especially between clusters 1 and 2. The distinctness of these clusters in the 3D space confirms the separation achieved by the LDA.
  • Dimensionality Reduction: LDA has effectively reduced the dimensionality of the data to three dimensions, which captures the majority of the variance between the clusters.
  • Cluster Characteristics: Clusters 3 and 4 appear to be more spread along the LD2 and LD3 axes, while clusters 1 and 2 are more compact. The spread could indicate variability within the clusters concerning the underlying variables.
  • Influential Variables: While the plot doesn’t directly show the contribution of each variable, the spread and orientation of the clusters can be partially attributed to the most influential variables identified previously, such as nox, rm, and chas.
  • Comparison with Crime Classes: If compared to a similar plot colored by crime classes, one might observe whether clusters defined by socio-economic factors (like crime rate) align with those determined through unsupervised k-means clustering. Similarities might suggest a relationship between crime rates and the variables used for clustering, while differences could indicate that the clustering captures other aspects of the data.

Chapter 5


Chapter 6


Chapter 7